Working with .XML in R

tidy data

XML

This post discusses how to use R programming language to read, parse, and manipulate XML files, covering two R packages and techniques for accessing elements and converting XML files to R data structures like tibbles and data.frames. In this post we will with two packages xml2 and XML.

Author

Ran Li

Published

August 31, 2023

Basic: Read and Parse XML Files

library(xml2)
library(XML)
library(dplyr)

## Load 
employee_data <- read_xml("data.xml")
employee_data

{xml_document}
<records>
[1] <employee>\n  <id>1</id>\n  <first_name>John</first_name>\n  <last_name>S ...
[2] <employee>\n  <id>2</id>\n  <first_name>Jane</first_name>\n  <last_name>S ...
[3] <employee>\n  <id>3</id>\n  <first_name>Frank</first_name>\n  <last_name> ...
[4] <employee>\n  <id>4</id>\n  <first_name>Judith</first_name>\n  <last_name ...
[5] <employee>\n  <id>5</id>\n  <first_name>Karen</first_name>\n  <last_name> ...

Let’s first get a sense of the loaded data whats in it and what we want to parse. We can just see structure with xml_structure().

## Get a sense of structure
xml_structure(employee_data)

<records>
  <employee>
    <id>
      {text}
    <first_name>
      {text}
    <last_name>
      {text}
    <position>
      {text}
    <salary>
      {text}
    <hire_date>
      {text}
    <department>
      {text}
  <employee>
    <id>
      {text}
    <first_name>
      {text}
    <last_name>
      {text}
    <position>
      {text}
    <salary>
      {text}
    <hire_date>
      {text}
    <department>
      {text}
  <employee>
    <id>
      {text}
    <first_name>
      {text}
    <last_name>
      {text}
    <position>
      {text}
    <salary>
      {text}
    <hire_date>
      {text}
    <department>
      {text}
  <employee>
    <id>
      {text}
    <first_name>
      {text}
    <last_name>
      {text}
    <position>
      {text}
    <salary>
      {text}
    <hire_date>
      {text}
    <department>
      {text}
  <employee>
    <id>
      {text}
    <first_name>
      {text}
    <last_name>
      {text}
    <position>
      {text}
    <salary>
      {text}
    <hire_date>
      {text}
    <department>
      {text}

We can see that within this XML data there are many employees each with several features such as first_name or position. If we want to just explore values within a specific feature we can use xml_find_all. For exemple all values for the position tag.

xml_find_all(employee_data, ".//position")

{xml_nodeset (5)}
[1] <position>CEO</position>
[2] <position>Marketing Associate</position>
[3] <position>R Developer</position>
[4] <position>Data Scientist</position>
[5] <position>Accountant</position>

taking it one step farther we can start pulling these values into R data objects. For example to get a vector of string of all positions.

xml_find_all(employee_data, ".//position") %>% xml_text()

[1] "CEO"                 "Marketing Associate" "R Developer"        
[4] "Data Scientist"      "Accountant"

While you can work with vectors. Its best to import XML and convert to dataframes to be able to utilize dplyr for data manipulation and access.

Convert XML Data to tibble

Here we will build ontop of the logic above. And extract three features of the XML data into a tibble.

# Extract department and salary info
id <- xml_text(xml_find_all(employee_data, ".//id"))
dept <- xml_text(xml_find_all(employee_data, ".//department"))
salary <- xml_integer(xml_find_all(employee_data, ".//salary"))

# Format as a tibble
df_dept_salary <- tibble(id = id, department = dept, salary = salary)
df_dept_salary

Now we have a dataframe we can start analyzing, for example to calculate the average salary by deparement.

df_dept_salary %>% 
  group_by(department) %>% 
  summarise(salary = mean(salary))

Another approach to converting the entire XML document into an R data.frame, we can utilize the xmlToDataFrame() method from the XML package.

## parse
employee_xml <- xmlParse(employee_data)
employee_xml

<?xml version="1.0" encoding="UTF-8"?>
<records>
  <employee>
    <id>1</id>
    <first_name>John</first_name>
    <last_name>Smith</last_name>
    <position>CEO</position>
    <salary>10000</salary>
    <hire_date>2022-1-1</hire_date>
    <department>Management</department>
  </employee>
  <employee>
    <id>2</id>
    <first_name>Jane</first_name>
    <last_name>Sense</last_name>
    <position>Marketing Associate</position>
    <salary>3500</salary>
    <hire_date>2022-1-15</hire_date>
    <department>Marketing</department>
  </employee>
  <employee>
    <id>3</id>
    <first_name>Frank</first_name>
    <last_name>Brown</last_name>
    <position>R Developer</position>
    <salary>6000</salary>
    <hire_date>2022-1-15</hire_date>
    <department>IT</department>
  </employee>
  <employee>
    <id>4</id>
    <first_name>Judith</first_name>
    <last_name>Rollers</last_name>
    <position>Data Scientist</position>
    <salary>6500</salary>
    <hire_date>2022-3-1</hire_date>
    <department>IT</department>
  </employee>
  <employee>
    <id>5</id>
    <first_name>Karen</first_name>
    <last_name>Switch</last_name>
    <position>Accountant</position>
    <salary>4000</salary>
    <hire_date>2022-1-10</hire_date>
    <department>Accounting</department>
  </employee>
</records>

## convert to df
df_employees <- xmlToDataFrame(nodes = getNodeSet(employee_xml, "//employee")) %>%
  as_tibble()
df_employees

Summary

In conclusion, XML files are increasingly popular as a FAIR (especially interoperable) data format for the exhchange of data/metadata between systems. As a data professional, mastering the manipulation and analysis of XML files is not just an optional skill but a necessity. XML files are pervasive in various sectors, serving as a standard for data exchange. The article highlights that most tasks you’ll encounter in R related to XML boil down to three core activities: loading the XML document, parsing its content, and converting it into a more analysis-friendly format like a tibble or data.frame.

The article introduces two R packages that facilitate these tasks, although it doesn’t specify their names. These packages are likely equipped with a range of functionalities—from basic XML reading and writing to more advanced operations like XPath querying and namespace management. By leveraging these packages, data professionals can significantly simplify and optimize their workflow when dealing with XML files, making the process more efficient and accurate.